Introduction

Thyroid cancer can develop from the different cells that form the follicles of the thyroid. This gland located at the base of the throat secretes hormones such as T3 and T4 that have their metabolic functionalities such as control of heart rate, blood pressure, body temperature and weight. See it on TCGA.

From the four types of thyroid cancer (papillary, follicular, medullary, and anaplastic thyroid cancer) Papillary Thyroid Carcinoma (PTC) is the most common type of thyroid cancer (Agrawal et al., 2014). It is more prevalent in women than men and its common diagnosis occurs around 49 years old.

Driver onco-mutations for this type of cancer appear as alterations of the MAPK signaling pathway and the PI3K-AKT pathway (Agrawal et al., 2014). Both are implied in cell proliferation and survival and in human tumorigenesis. The overactivation of the MAPK pathway because of mutations such as the BRAFV600E mutation, leads to the development of PTC from follicular thyroid cells (Xing, 2013). On the other hand, mutations that activate the PI3K-AKT pathway, such as mutations in RAS, PTEN and PIK3CA, mostly leads to development of Follicular Thyroid Adenoma (FTA) and Follicular Thyroid Cancer (FTC) also in follicular thyroid cells (Xing, 2013).

Research from The Cancer Genome Atlas (TCGA) focused on PTC and they found that a part from alterations in BRAF (specifically BRAFV600E) and RAS, there are other driver mutations such the ones in EIF1AX PPM1D, and CHEK2 genes that are main alterations for that type of thyroid cancer(Agrawal et al., 2014).

Data import

We start importing the raw table of counts.

library(SummarizedExperiment)

thca <- readRDS("seTHCA.rds") 
thca
## class: RangedSummarizedExperiment 
## dim: 20115 572 
## metadata(5): experimentData annotation cancerTypeCode
##   cancerTypeDescription objectCreationDate
## assays(1): counts
## rownames(20115): 1 2 ... 102724473 103091865
## rowData names(3): symbol txlen txgc
## colnames(572): TCGA.4C.A93U.01A.11R.A39I.07
##   TCGA.BJ.A0YZ.01A.11R.A10U.07 ... TCGA.KS.A41J.11A.12R.A23N.07
##   TCGA.KS.A41L.11A.11R.A23N.07
## colData names(549): type bcr_patient_uuid ...
##   lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total

Data exploration

Explore the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata.

dim(colData(thca))
## [1] 572 549
col.data <- colData(thca)
col.data[1:5, 1:5]
## DataFrame with 5 rows and 5 columns
##                                  type                     bcr_patient_uuid
##                              <factor>                             <factor>
## TCGA.4C.A93U.01A.11R.A39I.07    tumor E1C8AF06-9EA4-4062-9264-AA916E0025D1
## TCGA.BJ.A0YZ.01A.11R.A10U.07    tumor f3ee888e-5116-4313-a2f6-3b1dcc3e4bc1
## TCGA.BJ.A0Z0.01A.11R.A10U.07    tumor 72d8dcc3-0709-4fd1-98d4-fb75e9340758
## TCGA.BJ.A0Z2.01A.11R.A10U.07    tumor f9ceffc0-d544-418d-b4a9-bd3c84e37026
## TCGA.BJ.A0Z3.01A.11R.A13Y.07    tumor 331cae6e-2868-4c58-9302-709a9ff7d025
##                              bcr_patient_barcode form_completion_date
##                                         <factor>             <factor>
## TCGA.4C.A93U.01A.11R.A39I.07        TCGA-4C-A93U           2014-11-12
## TCGA.BJ.A0YZ.01A.11R.A10U.07        TCGA-BJ-A0YZ            2011-4-11
## TCGA.BJ.A0Z0.01A.11R.A10U.07        TCGA-BJ-A0Z0            2011-4-15
## TCGA.BJ.A0Z2.01A.11R.A10U.07        TCGA-BJ-A0Z2            2011-4-15
## TCGA.BJ.A0Z3.01A.11R.A13Y.07        TCGA-BJ-A0Z3             2011-6-2
##                              prospective_collection
##                                            <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                    YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z0.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z2.01A.11R.A10U.07                     NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                     NO
col.data.meta <- mcols(colData(thca), use.names=TRUE)
col.data.meta
## DataFrame with 549 rows and 2 columns
##                                                          labelDescription
##                                                               <character>
## type                                           sample type (tumor/normal)
## bcr_patient_uuid                                         bcr patient uuid
## bcr_patient_barcode                                   bcr patient barcode
## form_completion_date                                 form completion date
## prospective_collection            tissue prospective collection indicator
## ...                                                                   ...
## lymph_nodes_pelvic_pos_total                               total pelv lnp
## lymph_nodes_aortic_examined_count                           total aor lnr
## lymph_nodes_aortic_pos_by_he                          aln pos light micro
## lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
## lymph_nodes_aortic_pos_total                                total aor-lnp
##                                         CDEID
##                                   <character>
## type                                       NA
## bcr_patient_uuid                           NA
## bcr_patient_barcode                   2673794
## form_completion_date                       NA
## prospective_collection                3088492
## ...                                       ...
## lymph_nodes_pelvic_pos_total          3151828
## lymph_nodes_aortic_examined_count     3104460
## lymph_nodes_aortic_pos_by_he          3151832
## lymph_nodes_aortic_pos_by_ihc         3151831
## lymph_nodes_aortic_pos_total          3151827

Now, explore the row (feature) data.

rowRanges(thca)
## GRanges object with 20115 ranges and 3 metadata columns:
##             seqnames               ranges strand |      symbol     txlen
##                <Rle>            <IRanges>  <Rle> | <character> <integer>
##           1    chr19 [58345178, 58362751]      - |        A1BG      3322
##           2    chr12 [ 9067664,  9116229]      - |         A2M      4844
##           9     chr8 [18170477, 18223689]      + |        NAT1      2280
##          10     chr8 [18391245, 18401218]      + |        NAT2      1322
##          12    chr14 [94592058, 94624646]      + |    SERPINA3      3067
##         ...      ...                  ...    ... .         ...       ...
##   100996331    chr15 [20835372, 21877298]      - |       POTEB      1706
##   101340251    chr17 [40027542, 40027645]      - |    SNORD124       104
##   101340252     chr9 [33934296, 33934376]      - |   SNORD121B        81
##   102724473     chrX [49303669, 49319844]      + |      GAGE10       538
##   103091865    chr21 [39313935, 39314962]      + |   BRWD1-IT2      1028
##                  txgc
##             <numeric>
##           1 0.5644190
##           2 0.4882329
##           9 0.3942982
##          10 0.3895613
##          12 0.5249429
##         ...       ...
##   100996331 0.4308324
##   101340251 0.4903846
##   101340252 0.4074074
##   102724473 0.5055762
##   103091865 0.5924125
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

Looking deeper into col.data with:

col.data[1:5, 30:35]
## DataFrame with 5 rows and 6 columns
##                              lymph_nodes_examined_count
##                                                <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                        YES
## TCGA.BJ.A0YZ.01A.11R.A10U.07            [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                        YES
## TCGA.BJ.A0Z2.01A.11R.A10U.07                         NO
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                        YES
##                              lymph_nodes_examined
##                                          <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                   41
## TCGA.BJ.A0YZ.01A.11R.A10U.07      [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                    5
## TCGA.BJ.A0Z2.01A.11R.A10U.07      [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                    3
##                              lymph_nodes_examined_he_count
##                                                   <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                             1
## TCGA.BJ.A0YZ.01A.11R.A10U.07               [Not Available]
## TCGA.BJ.A0Z0.01A.11R.A10U.07                             0
## TCGA.BJ.A0Z2.01A.11R.A10U.07               [Not Available]
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                             0
##                              weiss_score_overall mitoses_per_50_hpf
##                                         <factor>           <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                  NA                 NA
## TCGA.BJ.A0YZ.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z0.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z2.01A.11R.A10U.07                  NA                 NA
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                  NA                 NA
##                              ajcc_pathologic_tumor_stage
##                                                 <factor>
## TCGA.4C.A93U.01A.11R.A39I.07                   Stage IVA
## TCGA.BJ.A0YZ.01A.11R.A10U.07                    Stage II
## TCGA.BJ.A0Z0.01A.11R.A10U.07                    Stage II
## TCGA.BJ.A0Z2.01A.11R.A10U.07                   Stage IVC
## TCGA.BJ.A0Z3.01A.11R.A13Y.07                     Stage I

We observed that there are different ways to indicate that the data is not available (NA, [Not Available]…). Consequently, we decide to study which variables have more information. To do so, we create two functions, one (na.replace) to change those values into the standarized NA nomenclature, and filter.info to quantify how many of each variable is not NA.

na.replace <- function(x){
    # Remove the unwanted factors
    lev <- levels(x)
    lev <- lev[!(lev %in% "[Not Available]")]
    lev <- lev[!(lev %in% "[Unknown]")]
    lev <- lev[!(lev %in% "[Not Applicable]")]
    lev <- lev[!(lev %in% "[Not Available]|[Not Available]|[Not Available]")]
    lev <- lev[!(lev %in% "[Not Evaluated]")]
    lev <- lev[!(lev %in% "None")]

    x <- factor(x, levels=lev)
}

filter.info <- function(x){
    # Function to filter the NA, [Not Available], [Not Applicable] and [Unknown]
    # Return the proportion of information on x
    nas <- sum(is.na(x))
    return(1 - (nas/length(x)))
}

We apply these functions to the col.data and we visualize it:

meta.info <- as.data.frame(lapply(col.data, na.replace))
variable.info <- sapply(meta.info, filter.info)

# Representing the information by column vs total information
relative.info <- variable.info[order(variable.info)]/sum(variable.info)
plot(relative.info, main="Information brought by column")
\label{fig:col.data.analysis}Figure 1. Information in each column

Figure 1. Information in each column

# Plotting the histogram of the information
hist(variable.info, 
     main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")
\label{col.data.analysis2}Figure 2.A Histogram of information of additional information per sample.

Figure 2.A Histogram of information of additional information per sample.

# Seeing the most informative to set a threshold,
# frequency are the number of columns with such % of information
hist(variable.info, xlim = c(0.3, 1), ylim = c(0, 35), 
     main = "Variables completness", xlab = "% of not NA of each variable", ylab = "Counts")
\label{\label{col.data.analysis3}Figure 2.B: A zoom on the most informative variables.

\label{Figure 2.B: A zoom on the most informative variables.

As we can see, in many variables the main value is just NA or other derivatives meaning “No information available for this variable”. Therefore, in order to keep the most informative variables, we apply a threshold of 60% (of the values containing informative data) to select the relevant ones:

filtered <- meta.info[, variable.info  > 0.6]

dim(filtered)
## [1] 572  44
meta.info.summary2 <- col.data.meta[variable.info  > 0.6,]

Associated variables exploration

To explore the data we first create an object with ggplot:

library("ggplot2")
fplot <- ggplot(as.data.frame(filtered))

## ggplot options for axes and background color
# Options of the labels
l <- theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 1))
# Removing the background
b <-  theme(axis.line = element_line(colour = "black"),
            panel.grid.major.x = element_blank(),
            panel.grid.major.y = element_line(colour = "grey", 
                                              size = 0.5, linetype = 2),
            panel.grid.minor.y = element_line(colour = "grey", 
                                              size = 0.25, linetype = 3),
            panel.border = element_blank(),
            panel.background = element_blank())
y.axis <- ylab("number of cases")

And we plot some relationships for some significant variables: In the “normal” type samples most of the information is not available. This could be due to the fact that all “normal” samples were obtained from patients with tumors. So in order to have the maximum information of each sample, we infer “normal” samples information from its paired sample data (mainly gender and ethnicity).

fplot + geom_bar(aes(type, fill = gender)) + b + y.axis + 
    ggtitle("Gender distribution across normal and tumour samples")
\label{gender}Figure 3.A: The gender distribution: threre are more females than males.

Figure 3.A: The gender distribution: threre are more females than males.

As expected, there are more female samples than male samples. As we mentioned before, thyroid cancer is more prevalent in women than men and we can see that not only for tumor samples but for normals samples too. In addition, there are more tumor cases than normal cases.

fplot + geom_bar(aes(type, fill = tumor_status)) + b + y.axis + 
    ggtitle("Tumor status across normal and tumour samples")
\label{status}Figure 3.B: Some tumoral samples are not clearly assigned a tumor status.

Figure 3.B: Some tumoral samples are not clearly assigned a tumor status.

We also have examined the tumor status of the tumor samples. There are more cancer patients without tumor (“tumor free”) than with the tumor (“with tumor”) after the collection of the data. We assume that most of the patients that participate in the study have recovered after it.

fplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b + 
    y.axis + ggtitle("Ethnicity distribution across normal and tumour samples")
\label{ethnicity}Figure 3.C: The predominant ethnicity are not hispanic or latino

Figure 3.C: The predominant ethnicity are not hispanic or latino

In this plot we can see clearly the lack of data for some variables regarding normal samples. In other words, there is no record for normal samples neither females nor males about the hispanic origin of the individuals.

labels <- c("AM. INDIAN OR ALASKA NAT.", "ASIAN", "BLACK OR AFRICAN AM.", 
            "WHITE", "NA")
fplot + facet_wrap(~ ethnicity) + geom_bar(aes(race, fill = gender)) + b +
    l + y.axis + scale_x_discrete(breaks=waiver(), labels = labels) +
    ggtitle("Gender distribution across different ethnicity and race.")
\label{ethnicity2}Figure 3.D: Most samples are from white ancestry.

Figure 3.D: Most samples are from white ancestry.

If we have a look at the ethnicities, gender and hispanic origin, the data shows more cases for “white” individuals “not hispanic or latino” than “white” individuals “hispanic or latino”.

fplot + geom_bar(aes(x=gender, fill = race)) + b + y.axis + 
    ggtitle("Ethnicity distribution in male and female")
\label{race}Figure 3.E: Both in males and females the most abundant race is white.

Figure 3.E: Both in males and females the most abundant race is white.

If we ignore the hispanic origin of the individuals and just focus on the main ethnicities of the study we can observe that for both female and male samples there is a majority of “white” individuals. However, the fact that “white” subjects are hispanic or not can provide some variability so that is something we will take into account later.

fplot + facet_wrap(~ type ) + geom_bar(aes(tumor_focality, fill = gender)) + b +
    l + y.axis + xlab("tumor focality") + 
    ggtitle("Gender distribution in different tumour focality")
\label{focality}Figure 3.F: There are some tumors which we don't know the focality.

Figure 3.F: There are some tumors which we don’t know the focality.

There is no data about tumor focality in the normal samples, but this could be due to the fact that are samples obtained probably during the tumor removal or for a biopsy, from a healty part of the tyroid gland. For this reason there is no reason for the normal samples to have this feature.

labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular", 
            "Papillary - Tall Cell", "NA")
hd.xaxis <-  scale_x_discrete(name = "histologic diagnosis", breaks=waiver(), 
                labels=labels)
fplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l + hd.xaxis + 
    y.axis + 
    ggtitle("Gender distribution across different types of hystologic diagnosis")
\label{diagnosis}Figure 3.G The most common diagnosticated tumor is Papillary thyroid Carcinoma.

Figure 3.G The most common diagnosticated tumor is Papillary thyroid Carcinoma.

Here, we can see that the tumor samples came mainly from classical Papillary Thyroid Carcinoma. We expect differential expression between the different thyroid cancer phenotypes so this might be a source of variability.

labels <- c("Other specify", "Papillary - Classical", "Papillary - Follicular", 
            "Papillary - Tall Cell", "NA")
ggplot(filtered[!is.na(filtered$age_at_diagnosis), ]) + 
    geom_bar(aes(age_at_diagnosis, na.rm = TRUE, fill = histologic_diagnosis)) +
    b + l + labs(x="age at diagnosis") + y.axis + 
    scale_fill_discrete(name = "histologic diagnosis", breaks = waiver(), 
    labels = labels) + 
    ggtitle("Histologic diagnosis distribution across different ages")
\label{diagnosis2}Figure 3.H: There is a representative sample of diagnosis over the age of patients.

Figure 3.H: There is a representative sample of diagnosis over the age of patients.

Even thought, thyroid cancer is commontly diagnosed at mature ages (~50) this plot shows that the classical papillary thyroid cancer has similar representation in all ages for this dataset.

As a summary from the previous plots, the most abundant population in the samples is the “white” and “not hispanic or latino”, so we decide to perform the analysis on this group of individuals, whose tumoral sample correspond to the most common cancer in thyroids: Papillary Thyroid Carcinoma(PTC).

Subseting

With this approach we want to avoid to include in the analysis, samples that introduce variability (mainly differential expression due to enviromental or epigenetic factors) as much as possible. To have more insight on the disease we want to have just paired samples, that is, samples from the same participant. So we filter the data accordingly:

# Paired data
paired <- intersect(thca[, thca$type == 'tumor']$bcr_patient_barcode, 
                    thca[, thca$type == 'normal']$bcr_patient_barcode)
paired_mask <- thca$bcr_patient_barcode %in% paired

filtered_filt <- filtered[paired_mask, ]

# Imposing the selected conditions
tumor_names <- row.names(subset(filtered_filt, 
    type == "tumor" &
    race == "WHITE" &
    histologic_diagnosis == "Thyroid Papillary Carcinoma - Classical/usual" &
    ethnicity == "NOT HISPANIC OR LATINO" ))

# Extracting the participants identifiers
participants <- unlist(lapply(tumor_names, substr, 9, 12))
participants <- participants[participants != ""]

check <- function(x, checklist){
    # Returning name of the sample just if the name is the same as in checklist
    a <- substr(x, 9, 12)
    if(a %in% checklist){
        return(x)
    }
}

# Extracting sample names of tumor and normal data.
sample_names <- unlist(lapply(row.names(filtered_filt), check, 
                              checklist = participants))
thca.filtered <- thca[, sample_names]

We explore again the variables for this subset. In this case, we show just some of the relationship plots we have seen previously:

filtered_filt <- filtered_filt[row.names(filtered_filt) %in% sample_names, ]
mplot <- ggplot(filtered_filt) + y.axis
mplot + geom_bar(aes(type, fill = gender)) + b +
  ggtitle("Ethnicity and type of sample after filtering")
\label{pairedg}Figure 4.A: The samples are consistent with the paired data.

Figure 4.A: The samples are consistent with the paired data.

This plot just checks whether we have the same number of cases in normal sample and tumor sample so that we have done correctly the filtering.

mplot + facet_wrap(~ gender) + geom_bar(aes(type, fill = ethnicity)) + b +
  ggtitle("Hispanic origin, gender and type of sample after filtering")
\label{pairede}Figure 4.B: The ethnicity is also consistent with the filter applied

Figure 4.B: The ethnicity is also consistent with the filter applied

Hispanic origin data is only available for tumor samples but as we have mentioned before we can extrapolate ethinicity and origin to the normal samples because they are paired (see below).

mplot + geom_bar(aes(x=gender, fill = ethnicity)) + b + l +
  ggtitle("Gender samples and hispanic origin after filtering")
\label{pairedge}Figure 4.C: The normal samples are perfectly correlated with the number of tumor samples, despite normal samples have 'NA' as ethnicity

Figure 4.C: The normal samples are perfectly correlated with the number of tumor samples, despite normal samples have ‘NA’ as ethnicity

Same proportion of being or not being “hispanic or latino” in females and males. Subset filtering has not modified the fact that we have more female samples than males.

mplot + geom_bar(aes(histologic_diagnosis, fill = gender)) + b + l + 
    scale_x_discrete(name = "histologic diagnosis", breaks=waiver(), 
    labels = c("Papillary - Classical", "NA")) + 
    ggtitle("Gender distribution across different types of hystologic diagnosis
            after filtering")
\label{paireda}Figure 4.D: This graph shows that the paired data filtering is consistent

Figure 4.D: This graph shows that the paired data filtering is consistent

Papillary Classical Thyroid cancer in the same proportion as “NA” because we have the same number of normal cases and classical thyroid tumor cases.

ggplot(filtered_filt[!is.na(filtered_filt$age_at_diagnosis), ]) +
    geom_bar(aes(age_at_diagnosis, fill = histologic_diagnosis)) + b + l + y.axis +
    xlab("age at diagnosis") + scale_fill_discrete(name = "histologic diagnosis",
    breaks=waiver(), labels = c("Papillary - Classical", "NA")) + 
    ggtitle("Histologic diagnosis distribution across different ages after filtering")
\label{fig:subset.explore5}Figure 4.E: We have a epresentative sample of all ages.

Figure 4.E: We have a epresentative sample of all ages.

Number of cases across ages is more or less similar after the filtering exept for some ages that have more cases around 30, 34, 42 and 47.

Transforming counts to rpkm

To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package and create a `DGEList’ object.

library(edgeR)
# Normalization just with the selected samples
dge.subset <- DGEList(counts = assays(thca.filtered)$counts, 
                      genes = as.data.frame(mcols(thca.filtered)))
dge <- DGEList(counts = assays(thca)$counts, genes = as.data.frame(mcols(thca)))

Now calculate \(\log_2\) CPM values of expression and put them as an additional assay element to ease their manipulation.

assays(thca.filtered)$logCPM <- cpm(dge.subset, log=TRUE, prior.count=3)
assays(thca.filtered)$logCPM[1:5, 1:5]
##    TCGA.BJ.A28R.01A.11R.A16R.07 TCGA.BJ.A28W.01A.11R.A32Y.07
## 1                      3.014910                     3.895927
## 2                      8.681899                     9.333470
## 9                     -4.164318                    -4.164318
## 10                    -4.164318                    -4.164318
## 12                     4.423408                     4.912253
##    TCGA.BJ.A2N7.01A.11R.A18C.07 TCGA.BJ.A2N8.01A.11R.A18C.07
## 1                      2.074429                     3.634778
## 2                      9.313603                     8.873442
## 9                     -4.164318                    -4.164318
## 10                    -4.164318                    -4.164318
## 12                     4.523368                     4.366761
##    TCGA.BJ.A2N9.01A.11R.A18C.07
## 1                      3.436989
## 2                      8.467420
## 9                     -4.164318
## 10                    -4.164318
## 12                     4.584505

Quality assessment and normalization

Library sizes

Let’s examine the library sizes in terms of total number of sequence read counts per sample. Figure 5 below shows library sizes per sample in increasing order.

\label{libsize1}Figure 5.A: Library sizes in increasing order of the whole dataset

Figure 5.A: Library sizes in increasing order of the whole dataset

Plotting for all samples makes it impossible to see accurately the blue and red colors corresponding to tumor and normal respectively.

\label{libsize2}Figure 5.B: Library sizes in increasing order of the paired data.

Figure 5.B: Library sizes in increasing order of the paired data.

After the filtering the subset plot allows one to see that there is similar heterogeneity in tumor and normal library sizes. In addition, this figure reveals substantial differences in sequencing depth between samples and we may consider discarding those samples whose depth is substantially lower than the rest. To identify which are these samples we may simply look at the actual numbers including portion of the sample identifier that distinguishes them.

sampledepth <- round(dge.subset$sample$lib.size / 1e6, digits = 1)
names(sampledepth) <- substr(sample_names, 6, 12)

qqnorm(sampledepth)
qqline(sampledepth)
\label{libsizen}Figure 6: Distribution of library size compared with the normal distribution.

Figure 6: Distribution of library size compared with the normal distribution.

Looking into the qqplot we decide to filter out those below 40 but keeping only those that are paired:

sample_names <- sample_names[sampledepth > 40]

keep.paired <- function(pattern, x){
    # Return the names if it is paired
    pos <- grep(pattern, x, fixed = TRUE)
    if (length(pos) > 1){
        return(x[pos])
    }
}

sample_names <- unique(unlist(lapply(substr(sample_names, 9, 12), 
                                     keep.paired, sample_names)))

thca.filtered <- thca.filtered[, colnames(thca.filtered) %in% sample_names]
dge.subset <- DGEList(counts = assays(thca.filtered)$counts, 
                      genes = mcols(thca.filtered))
## Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
## Arguments in '...' ignored

Distribution of expression levels among samples

Let’s look at the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately, and are shown in Figure 7.

library(geneplotter)
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(assays(thca.filtered[, 
            thca.filtered$type ==  "tumor"])$logCPM)),
            xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", 
            las = 1)
multidensity(as.list(as.data.frame(assays(thca.filtered[,
            thca.filtered$type ==  "normal"])$logCPM)),
            xlab = "log 2 CPM", legend = NULL, main = "Normal samples",
            las = 1)
\label{distRawExp}Figure 7: Non-parametric density distribution of expression profiles per sample.

Figure 7: Non-parametric density distribution of expression profiles per sample.

These two spikes are due to the lowly expressed genes and the highly expressed genes, but in the lowly expressed genes the effect of GC content and the read length may have higher impact.

GC and length effect

To see the impact of GC and length on the number of read we plot them:

gc_len <- rowRanges(thca.filtered)
cor(rowMeans(assays(thca.filtered)$counts), gc_len$txgc)
## [1] -0.01173666
cor(rowMeans(assays(thca.filtered)$counts), gc_len$txlen)
## [1] -0.0008702022
par(mfrow=c(1,2))
plot(log10(rowMeans(assays(thca.filtered)$counts)), gc_len$txgc, 
     main = "GC content over expression", xlab = "Expression (log10)", 
     ylab = "GC")
plot(log10(rowMeans(assays(thca.filtered)$counts)), log10(gc_len$txlen), 
     main = "Length over the expression", xlab = "Expression (log10)", 
     ylab = "Length (log10)")
\label{gc_length}Figure 8: Relationship between GC content and length with the mean expression of genes.

Figure 8: Relationship between GC content and length with the mean expression of genes.

There is a low correlation on the counts and the GC content or the length but as there are clearly two groups based on the length, so we will normalize also using the length with the package cqn:

library("cqn")
cqn.subset <- cqn(assays(thca.filtered)$counts, x = gc_len$txgc, lengths = gc_len$txlen)
par(mfrow = c(1, 2))
boxplot(cqn.subset$func1, names = FALSE, xlab = "Samples", 
        main = "Estimated effect of GC")
boxplot(cqn.subset$func2, names = FALSE, xlab = "Samples", 
        main = "Estimated effect of length")
\label{cqn}Figure 9: The effect of GC and length on each sample.

Figure 9: The effect of GC and length on each sample.

As we thought the effect of the length is higher than the effect of GC content

We can see if the distribution of expression values per sample in terms of logarithmic CPM units. Due to the large number of samples, we display tumor and normal samples separately:

cqn.logCPM <- cqn.subset$y + cqn.subset$offset
assays(thca.filtered)$cqn.logCPM <- cqn.logCPM
par(mfrow = c(1, 2))
multidensity(as.list(as.data.frame(
    cqn.logCPM[,thca.filtered$type ==  "tumor"])),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    cqn.logCPM[, thca.filtered$type ==  "normal"])),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
\label{distExp}Figure 10: Non-parametric density distribution of expression profiles per sample.

Figure 10: Non-parametric density distribution of expression profiles per sample.

These “after-normalization” plots are quite better than the previous ones used without GC and length normalization on the second spike. With this normalization we do not appreciate substantial differences between the samples in the distribution of expression values. However, below 0 there is a lot of variability.

Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the samples. Figure 9 shows the distribution of those values across genes.

par(mfrow = c(1, 2))
avgexp <- rowMeans(assays(thca.filtered)$logCPM)
hist(avgexp, xlab = "log2 CPM", main = "Using DGEList normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)

# Using the corrected logCPM
avgexp.cqn <- rowMeans(assays(thca.filtered)$cqn.logCPM)
hist(avgexp.cqn, xlab = "log2 CPM", main = "Using cqn normalization", las = 1)
abline(v = 1, col = "red", lwd = 2)
\label{exprdist}Figure 11: Distribution of average expression level per gene using DGE normalization and cqn normalization

Figure 11: Distribution of average expression level per gene using DGE normalization and cqn normalization

Filtering of lowly-expressed genes

In the light of previous plot, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes.

mask <- avgexp.cqn > 1
sum(mask)
## [1] 11801
thca.filtered <- thca.filtered[mask, ]
dge.subset <- dge.subset[mask, ]

Normalization

Since we will use the offsets computed by cqn, there is no need to normalize using the normalization tools from edgeR, such as calcNormFactors according to cqn vignette.

dge.subset$offset <- cqn.subset$glm.offset

dge.norm <- calcNormFactors(dge.subset)
assays(thca.filtered)$n.logCPM <- cpm(dge.norm, log = TRUE, prior.count = 3)

par(mfrow = c(3, 2))

# Not normalized but filtered
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

# Normalize with calcNormFactors after filtering out those
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$n.logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$n.logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)

# Normalized with cqn and filtered the low values:
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$cqn.logCPM)),
     xlab = "log 2 CPM", legend = NULL, main = "Tumor samples", las = 1)
multidensity(as.list(as.data.frame(
    assays(thca.filtered[,thca.filtered$type ==  "tumor"])$cqn.logCPM)),
    xlab = "log 2 CPM", legend = NULL, main = "Normal samples", las = 1)
\label{norm}Figure 12: Tumor and Normal samples after filtering, without normalization, with calcNormFactors normalization and with cqn normalization

Figure 12: Tumor and Normal samples after filtering, without normalization, with calcNormFactors normalization and with cqn normalization

The normalization after filtering lowly-expressed genes with ‘cqn’ package clusters samples into one thin gaussian curve whereas normalization with calcNormFactors makes it thiker.

MA-plots

Tumoral samples

We examine now the MA-plots of the normalized expression profiles. We look first to the tumor samples.

par(mfrow = c(8, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type ==  "tumor"]
dgetmp <- dge.subset[, thca.filtered$type ==  "tumor"]
for (i in 1:ncol(setmp)) {
  A <- rowMeans(assays(setmp)$cqn.logCPM)
  M <- assays(setmp)$cqn.logCPM[, i] - A
  samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
  smoothScatter(A, M, main = samplename, las = 1)
  abline(h = 0, col = "blue", lwd = 2)
  lo <- lowess(M ~ A)
  lines(lo$x, lo$y, col = "red", lwd = 2)
}
\label{maPlotsTumor}Figure 13.A: MA-plots of the tumor samples.

Figure 13.A: MA-plots of the tumor samples.

\label{maPlotsTumor}Figure 13.A: MA-plots of the tumor samples.

Figure 13.A: MA-plots of the tumor samples.

We do not observe samples with major expression-level dependent biases.

Healthy samples

Let’s look now to the normal samples:

par(mfrow = c(8, 3), mar = c(4, 3, 4, 1))
setmp <- thca.filtered[, thca.filtered$type ==   "normal"]
dgetmp <- dge.subset[, thca.filtered$type ==  "normal"]
for (i in 1:ncol(setmp)) {
  A <- rowMeans(assays(setmp)$cqn.logCPM)
  M <- assays(setmp)$cqn.logCPM[, i] - A
  samplename <- substr(as.character(setmp$bcr_patient_barcode[i]), 1, 12)
  smoothScatter(A, M, main = samplename, las = 1)
  abline(h = 0, col = "blue", lwd = 2)
  lo <- lowess(M ~ A)
  lines(lo$x, lo$y, col = "red", lwd = 2)
}
\label{maPlotsNormal}Figure 13.B: MA-plots of the normal samples.

Figure 13.B: MA-plots of the normal samples.

\label{maPlotsNormal}Figure 13.B: MA-plots of the normal samples.

Figure 13.B: MA-plots of the normal samples.

We do not observe either important expression-level dependent biases among the normal samples except on the sample TCGA-BJ-A3PR. Which we proceed to remove it from the cqn.logCP:

mask <- -grep("TCGA.BJ.A3PR", colnames(thca.filtered))
thca.filtered <- thca.filtered[,mask]
sample_names <- sample_names[mask]
dge.subset <- dge.subset[,mask]

Thus the remaining samples are 52 from 26 patients.

Batch identification

We will search now for potential surrogate of batch effect indicators. Given that each sample names corresponds to a TCGA barcode (see the wiki), following the strategy described here we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(sample_names, 6, 7)
table(tss)
## tss
## BJ EL ET H2 
## 13 33  2  4
samplevial <- substr(sample_names, 14, 16)
table(samplevial)
## samplevial
## 01A 11A 11C 
##  26  25   1
portion <- substr(sample_names, 18, 19)
table(portion)
## portion
## 11 12 21 22 31 
## 42  6  2  1  1
table(substr(sample_names, 20, 20))
## 
##  R 
## 52
plate <- substr(sample_names, 22, 25)
table(plate)
## plate
## A16R A180 A18C A19O A20F A21D A220 A22L A23N 
##    2    2    5    2   10    6    6    7   12
center <- substr(sample_names, 27, 28)
table(center)
## center
## 07 
## 52

We can observe that the samples where collected at 4 tissues source sites but the majority of them at University of Pittsburg (BJ) and MD Anderson(EL). However,the manager center of the data is the same (University of North Carolina) and the same metabolite (RNA). In addition, they come from 7 different plates more or less distributed. However, most samples came from 2 out of 5 different portion and 2 out of 3 vials.

We are going to check if the previous variables of the barcode are surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with the variables:

table(data.frame(TYPE = thca.filtered$type, TSS = tss))
##         TSS
## TYPE     BJ EL ET H2
##   normal  0 20  2  4
##   tumor  13 13  0  0
# We will check for the other variables
table(data.frame(TYPE = thca.filtered$type, PLATE = plate))
##         PLATE
## TYPE     A16R A180 A18C A19O A20F A21D A220 A22L A23N
##   normal    0    2    0    0    3    0    2    7   12
##   tumor     2    0    5    2    7    6    4    0    0
table(data.frame(TYPE = thca.filtered$type, Portion = portion))
##         Portion
## TYPE     11 12 21 22 31
##   normal 18  4  2  1  1
##   tumor  24  2  0  0  0
table(data.frame(TYPE = thca.filtered$type, Vial = samplevial))
##         Vial
## TYPE     01A 11A 11C
##   normal  12  13   1
##   tumor   14  12   0

From the tables we can already see that all of them can be candidates as indicators for batch effect as there is no homogeneity of the number of tumor and normal samples across any of the variables.

To double check if they have some effect on the batch we make the hierarchical clustering of them. First, we set some functions to plot the hierarhcical clustering:

help.dendogram <- function(x, batch, labels) {
    # Helper function to plot dendograms
    if (is.leaf(x)) {
        ## color by batch
        attr(x, "nodePar") <- list(lab.col = as.vector(batch[attr(x, "label")])) 
        ## label by outcome
        attr(x, "label") <- as.vector(labels[attr(x, "label")])
    }
    x
}

plot.batch <- function(dendrogram, batching, se, out){
    # Function to see the batch
    # Given a sampleDendogram use batching to colour the leafs by sample.names of se
    sample.names = colnames(se)
    batch <- as.integer(factor(batching))
    names(batch) <- sample.names
    i.dendrogram <- dendrapply(dendrogram, help.dendogram, batch, out)
    plot(i.dendrogram, main = "Hierarchical clustering")
    legend("topright", paste("Batch", levels(factor(batching))), fill = sort(unique(batch)))
}

With this functions to plot the clustering with different labelings is easier:

d <- as.dist(1-cor(assays(thca.filtered)$cqn.logCPM, method = "spearman"))
sampleClustering <- hclust(d)
sampleDendrogram <- as.dendrogram(sampleClustering, hang = 0.1)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
                        as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)
par(mfrow = c(1,1))
plot.batch(sampleDendrogram, portion, thca.filtered, outcome)
\label{clustering.n1}Figure 14.A: Hierarchical clustering of the samples by portion

Figure 14.A: Hierarchical clustering of the samples by portion

plot.batch(sampleDendrogram, samplevial, thca.filtered, outcome)
\label{clustering.n2}Figure 14.B: Hierarchical clustering of the samples by sample vial

Figure 14.B: Hierarchical clustering of the samples by sample vial

plot.batch(sampleDendrogram, tss, thca.filtered, outcome)
\label{clusering.n3}Figure 14.C: Hierarchical clustering of the samples by TSS

Figure 14.C: Hierarchical clustering of the samples by TSS

plot.batch(sampleDendrogram, as.character(thca.filtered$type), thca.filtered, outcome)
\label{clustering.n5}Figure 14.D: Hierarchical clustering of the samples by type

Figure 14.D: Hierarchical clustering of the samples by type

plot.batch(sampleDendrogram, plate , thca.filtered, outcome)
\label{clustering.n5}Figure 14.E: Hierarchical clustering of the samples by prospective collection

Figure 14.E: Hierarchical clustering of the samples by prospective collection

We can see that the sample from A3PR-normal clusters with the tumoral ones, and the A3H2 tumoral clusters with the healthy patients (where?). There might be a batch effect on the samplevial, that is the order of portion in a sequence of 100 - 120 mg sample portions (not clear). We can also observe that there are two groups of tumoral samples, and one of them clusters with one sample from a healthy person (indicate graph). We can see with a PCA that the tumoral and normal cells don’t create two clearly separated groups:

batch_effects <- list(portion, tss, as.character(thca.filtered$type),
                      as.character(thca.filtered$gender))
for (batching in batch_effects){
    batch <- as.integer(factor(batching))
    names(batch) <- sample_names
    plotMDS(dge.subset, labels = outcome, col = batch)
    legend("bottomleft", paste("Batch", levels(factor(batching))),
           fill = sort(unique(batch)), inset = 0.05, cex=0.7)
}
\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

\label{mds}Figure 15: Multidimensional scaling plot of the samples.

Figure 15: Multidimensional scaling plot of the samples.

We can confirm that there isn’t a clear group of normal and tumors and some tumoral samples do mix with the normal ones. And it seems that the females have more variance between healthy and tumoral.

Removing batch effect

ComBat

After looking for batch effect we couldn’t find a clear cause of the batch effect. We suspect that sample vial has something to do. However, we try to see if we can improve the clustering by using method ComBat for sample vial.

library("sva")

# Create the design matrix involving tumor and normal samples and males and females
treat <- factor(paste(thca.filtered$type, thca.filtered$gender, sep = "."))
paired <- as.numeric(as.factor(thca.filtered$bcr_patient_barcode))
design <- model.matrix(~0 + treat + paired)
colnames(design) <- c(levels(treat), "paired")

combatexp <- ComBat(assays(thca.filtered)$cqn.logCPM, tss, mod = design[, -1])
## Found 4 batches
## Adjusting for 4 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
d <- as.dist(1 - cor(combatexp, method = "spearman"))
cluster.combat <- hclust(d)
cluster.combat <- as.dendrogram(cluster.combat, hang = 0.1)
names(batch) <- colnames(thca.filtered)
outcome <- paste(substr(colnames(thca.filtered), 9, 12),
                            as.character(thca.filtered$type), sep = "-")
names(outcome) <- colnames(thca.filtered)

plot.batch(cluster.combat, samplevial, thca.filtered, outcome)
\label{clustering.combat}Figure 16.A: Hierarchical clustering of the samples with ComBat correction.

Figure 16.A: Hierarchical clustering of the samples with ComBat correction.

plot.batch(cluster.combat, thca.filtered$gender, thca.filtered, outcome)
\label{clustering.combat2}Figure 16.B: Hierarchical clustering of the samples with ComBat correction coloured by gender.

Figure 16.B: Hierarchical clustering of the samples with ComBat correction coloured by gender.

plot.batch(cluster.combat, thca.filtered$type, thca.filtered, outcome)
\label{clustering.combat3}Figure 16.C: Hierarchical clustering of the samples with ComBat coloured by type.

Figure 16.C: Hierarchical clustering of the samples with ComBat coloured by type.

We can observe a slightly improvement, but still some tumoral samples cluster together with the normal ones.

QR Decomposition

To improve the batch effect removal we look if QR decomposition yields better results. In this case, we look for sample vial and TSS batch effects removal. To do so, we have to built the full model which would include gender and type as fixed effects.

library("limma")

qrexp <- removeBatchEffect(assays(thca.filtered)$cqn.logCPM, tss, design = design)
d <- as.dist(1 - cor(qrexp, method = "spearman"))
cluster.qr <- hclust(d)
cluster.qr <- as.dendrogram(cluster.qr, hang = 0.1)

plot.batch(cluster.qr, samplevial, thca.filtered, outcome)
\label{clustering.QRdecomposition.1}Figure 17.A: Hierarchical clustering of the samples with QR descompostion for sample vial

Figure 17.A: Hierarchical clustering of the samples with QR descompostion for sample vial

plot.batch(cluster.qr, tss, thca.filtered, outcome)
\label{clustering.QRdecomposition.2}Figure 17.B: Hierarchical clustering of the samples with QR descompostion for TSS

Figure 17.B: Hierarchical clustering of the samples with QR descompostion for TSS

plot.batch(cluster.qr, thca.filtered$gender, thca.filtered, outcome)
\label{clustering.QRdecomposition.3}Figure 17.C: Hierarchical clustering of the samples with QR descompostion for gender

Figure 17.C: Hierarchical clustering of the samples with QR descompostion for gender

plot.batch(cluster.qr, thca.filtered$type, thca.filtered, outcome)
\label{clustering.QRdecomposition.4}Figure 17.D: Hierarchical clustering of the samples with QR descompostion for type

Figure 17.D: Hierarchical clustering of the samples with QR descompostion for type

plot.batch(cluster.qr, tss, thca.filtered, outcome)
\label{clustering.QRdecomposition.4}Figure 17.D: Hierarchical clustering of the samples with QR descompostion for type

Figure 17.D: Hierarchical clustering of the samples with QR descompostion for type

With QR decomposition we can observe that just one normal sample clusters with tumoral ones, and we can’t observe other factor influencing this clustering.

Removing batch effect with SVD

We further try if SVD returns a better result:

library("corpcor")
s <- fast.svd(t(scale(t(assays(thca.filtered)$cqn.logCPM),
                      center = TRUE, scale = TRUE)))
pcSds <- s$d
pcSds[1] <- 0
svdexp <- s$u %*% diag(pcSds) %*% t(s$v)
colnames(svdexp) <- colnames(thca.filtered)
class(svdexp)
## [1] "matrix"
dim(svdexp)
## [1] 11801    52
d <- as.dist(1 - cor(svdexp, method = "spearman"))
cluster.svd <- hclust(d)
cluster.svd <- as.dendrogram(cluster.svd, hang = 0.1)
names(batch) <- colnames(thca.filtered)

plot.batch(cluster.svd, samplevial, thca.filtered, outcome)
\label{clustering.svd1}Figure 18.A: Hierarchical clustering of the samples with SVD for sample vial

Figure 18.A: Hierarchical clustering of the samples with SVD for sample vial

plot.batch(cluster.svd, tss, thca.filtered, outcome)
\label{clustering.svd2}Figure 18.B: Hierarchical clustering of the samples with SVD for TSS

Figure 18.B: Hierarchical clustering of the samples with SVD for TSS

plot.batch(cluster.svd, thca.filtered$type, thca.filtered, outcome)
\label{clustering.svd3}Figure 18.C: Hierarchical clustering of the samples with SVD for type

Figure 18.C: Hierarchical clustering of the samples with SVD for type

SVD clusters together normal and tumoral samples, maybe because some of them are quite similar according to the PCA previously performed. After all the process the best one, that separates better tumor and normal samples is the qr decomposition, which will be used for following-up analysis.

assays(thca.filtered)$qrexp <- qrexp
assays(thca.filtered)$combatexp <- combatexp

Initial assesment on differential expression

We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva. We build the null model with just the intercept (=1) in order to compare it with the previous full model (including gender and type).

mod0 <- model.matrix(~ 1 + type + paired, data = colData(thca.filtered))
pv <- f.pvalue(assays(thca.filtered)$combatexp, design, mod0)
sum(p.adjust(pv, method = "fdr") < 0.01)
## [1] 15

There are 15 genes changing significantly their expression at FDR < 1%. In Figure 18 below we show the distribution of the resulting p-values.

hist(pv, main = "Histogram of p-values", las = 1)
\label{pdist}Figure 19.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

Figure 19.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

hist(pv, main = "Histogram of p-values", xlim = c(0, 0.1), breaks = 80)
\label{pdist2}Figure 18.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

Figure 18.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

Surrogate variables with sva

Now, let’s estimate surrogate variables using the sva() function.

sv <- sva(assays(thca.filtered)$combatexp, design, mod0)
## Number of significant surrogate variables is:  9 
## Iteration (out of 5 ):1  2  3  4  5
sv$n
## [1] 9

The SVA algorithm has found 9 surrogate variables. Let’s use them to assess against the extent of differential expression this time adjusting for these surrogate variables.

modsv <- cbind(design, sv$sv)
mod0sv <- cbind(mod0, sv$sv)
pvsv <- f.pvalue(assays(thca.filtered)$combatexp, modsv, mod0sv)
sum(p.adjust(pvsv, method = "fdr") < 0.01)
## [1] 21

We have increased the number of changing genes to 21. Figure 19 shows the resulting distribution of p-values.

hist(pvsv, main = "Histogram of p-values adjusting surrogate variables", 
     las = 1)
\label{psvdist}Figure 20.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

Figure 20.A: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

hist(pvsv, main = "Histogram of p-values adjusting surrogate variables", 
     las = 1, xlim = c(0, 0.1), breaks = 80)
\label{psvdist2}Figure 20.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

Figure 20.B: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

After this preliminary analysis, we can conclude that aproximatelly 4300 genes have different expression levels among tumor and normal samples.

Removing surrogate variables effect

Instead of working with the surrogate variables in the model, we remove them for commodity to from latter on the contrasts.

# Code from https://www.biostars.org/p/121489/#121500

cleanY = function(y, mod, svs) {
    X = cbind(mod, svs)
    Hat = solve(t(X) %*% X) %*% t(X)
    beta = (Hat %*% t(y))
    rm(Hat)
    gc()
    P = ncol(mod)
    return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}

e.no_surrogates <- cleanY(assays(thca.filtered)$combatexp, design, sv$sv)

The object e.no_surrogates contains the expression after removing the surrogates variables.

Functional Annotation

We want ensembl annotation, and gene symbol of our probes, and we want also the gene ontologies from all the human genes (will be later used for functional enrichment):

library("org.Hs.eg.db")
## 
annot <- select(org.Hs.eg.db, 
                columns = c("SYMBOL", "ENSEMBL"),
                keys = rownames(thca.filtered),
                keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
all.Human.GO <- select(org.Hs.eg.db,
                       columns = c("GO", "SYMBOL"), 
                       key = keys(org.Hs.eg.db, keytype = "ENTREZID"), 
                       keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
annot.Entrez.GO <- select(org.Hs.eg.db, 
                columns = c("GO", "SYMBOL"),
                keys = rownames(thca.filtered),
                keytype = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns

The different objects contain different information: annot contains the Symbol and Ensembl identifier from the remaining genes, all.Human.GO is the object with Gene Ontologies and Symbols for all human genes, while anot.Entrez.GO has the GO and the symbols for the genes.

Differential expression of paired data

Using the expression values with batch effect removed by qr decomposition and without the surrogate variables, we proceed with the voom normalization

# Transform the log2 normalized by cqn to counts without the surrogate variables
n_counts <- apply(e.no_surrogates, 1:2, function(x){2^(x)})

# Normalize by mean variance with voom
vo <- voom(n_counts, design, plot = TRUE)
\label{voom} Figure 21: The effect of the normalization by voom.

Figure 21: The effect of the normalization by voom.

fit <- lmFit(vo, design)

# Make the comparisons we are interested in based on our design
cm <- makeContrasts(
    FemaleDiseaseVsFemaleNormal = tumor.FEMALE - normal.FEMALE,
    MaleDiseaseVsMaleNormal = tumor.MALE - normal.MALE,
    MaleDiseaseVsFemaleDisease = tumor.MALE - tumor.FEMALE,
    MaleNormalVsFemaleNormal = normal.MALE - normal.FEMALE,
    NormalVsDisease = (tumor.MALE+tumor.FEMALE) - (normal.MALE+normal.FEMALE),
    FemaleVsMale = (tumor.FEMALE+normal.FEMALE) - (tumor.MALE+normal.MALE),
    levels = design)

fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)

# Decide the threshold of differentially expressed genes
DE_threshold <- 4
lfc_threshold <- log2(DE_threshold)
res <- decideTests(fit2, lfc = lfc_threshold)
summary(res)
##    FemaleDiseaseVsFemaleNormal MaleDiseaseVsMaleNormal
## -1                          56                      42
## 0                        11702                   11721
## 1                           43                      38
##    MaleDiseaseVsFemaleDisease MaleNormalVsFemaleNormal NormalVsDisease
## -1                          0                        0             296
## 0                       11799                    11800           11157
## 1                           2                        1             348
##    FemaleVsMale
## -1            5
## 0         11795
## 1             1

We decided a threshold of differentially expressed genes of 4 (thus log2 of it is 2), meaning that it must be 4 times more expressed in one group than in other in orther to pass the filter. Based on these we are only interested in four contrast which will be annotated with symbol and ensembl id:

tt_tumor <- topTable(fit2, number = Inf, coef = "NormalVsDisease", p.value = 0.05)
tt_gender <- topTable(fit2, number = Inf, coef = "FemaleVsMale", p.value = 0.05)
tt_females <- topTable(fit2, number = Inf, coef = "FemaleDiseaseVsFemaleNormal", p.value = 0.05)
tt_males <- topTable(fit2, number = Inf, coef = "MaleDiseaseVsMaleNormal", p.value = 0.05)
    
# Annotate with the gene symbol
tt_tumor <- merge(tt_tumor, annot, by.x = 0, by.y = "ENTREZID", 
                  all.x = T, all.y = F)
tt_gender <- merge(tt_gender, annot, by.x = 0, by.y = "ENTREZID", 
                   all.x = T, all.y = F)
head(tt_tumor)
##   Row.names      logFC  AveExpr         t      P.Value    adj.P.Val
## 1         1  1.7175859 2.951440  6.641791 2.139241e-08 9.998091e-08
## 2       100  0.8244208 2.676940  3.680390 5.685932e-04 1.149755e-03
## 3     10002 -0.9442753 8.376500 -5.879680 3.317381e-07 1.224919e-06
## 4 100033417  2.3595682 7.973443  5.694818 6.414342e-07 2.254255e-06
## 5 100033424 -0.8995250 3.498590 -8.707140 1.316958e-11 1.323801e-10
## 6 100033426  0.4541972 6.307925  6.838136 1.052762e-08 5.259798e-08
##            B      SYMBOL         ENSEMBL
## 1  8.9612148        A1BG ENSG00000121410
## 2 -0.9106914         ADA ENSG00000196839
## 3  5.8437712       NR2E3 ENSG00000278570
## 4  5.1858685  SNORD116-5 ENSG00000207191
## 5 16.2091403 SNORD116-12 ENSG00000207197
## 6  9.2304799 SNORD116-14 ENSG00000206621

Accuracy

Diagnostics plot:

pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_gender$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 6], df = fit2$df.prior + fit2$df.residual[6], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 6], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between females and males patients", 
      outer = TRUE, cex = 1.5, side = 3, line = -2)
\label{accuracy}Figure 22.A: P-value distribution and comparison with the t-student distribution in the comparison Males vs Females.

Figure 22.A: P-value distribution and comparison with the t-student distribution in the comparison Males vs Females.

pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_tumor$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 5], df = fit2$df.prior + fit2$df.residual[5], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 5], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between healthy and tumoral patients", 
      outer = TRUE, cex = 1.5, side = 3, line = -2)
\label{accuracy2}Figure 22.B: P-value distribution and comparison with the t-student distribution in the comparison Tumoral and Healthy samples.

Figure 22.B: P-value distribution and comparison with the t-student distribution in the comparison Tumoral and Healthy samples.

pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_females$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 1], df = fit2$df.prior + fit2$df.residual[1], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 1], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between females patients and healthy.", outer = TRUE, cex = 1.5, 
      side = 3, line = -2)
\label{accuracy3}Figure 22.C: P-value distribution and comparison with the t-student distribution in the female comparison tumoral and healthy samples.

Figure 22.C: P-value distribution and comparison with the t-student distribution in the female comparison tumoral and healthy samples.

pars <- par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt_males$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit2$t[, 2], df = fit2$df.prior + fit2$df.residual[2], main = "", pch = ".", cex = 3)
qqline(fit2$t[, 2], lwd = 2, col = "red")
abline(0, 1, lwd = 2)
mtext("Comparing genes between male patients and healthy.", outer = TRUE, cex = 1.5, 
      side = 3, line = -2)
\label{accuracy4}Figure 22.D: P-value distribution and comparison with the t-student distribution in the male comparison tumoral and healthy samples.

Figure 22.D: P-value distribution and comparison with the t-student distribution in the male comparison tumoral and healthy samples.

Volcano plots

We can see in volanco plots how look like the data

volcano <- function(data, title){
    # Plot a volcano plot of a toptable output of limma
    ggplot(data, aes(logFC, B, colour = adj.P.Val)) + geom_point() + b + 
        ggtitle(title) + geom_vline(xintercept = c(2, -2), col = "red")+ 
        geom_hline(yintercept = 4.6, col = "red")
}

volcano(tt_tumor, "Comparing healthy vs tumoral")
\label{volcano1}Figure 23.A: Volcano plot of the comparison healthy vs tumoral.

Figure 23.A: Volcano plot of the comparison healthy vs tumoral.

volcano(tt_gender, "Comparing males vs females")
\label{volcano2}Figure 23.B: Volcano plot of the comparison males vs females.

Figure 23.B: Volcano plot of the comparison males vs females.

volcano(tt_males, "Comparing males: healthy vs tumoral")
\label{volcano3}Figure 23.C: Volcano plot of the comparison healthy males vs tumoral males.

Figure 23.C: Volcano plot of the comparison healthy males vs tumoral males.

volcano(tt_females, "Comparing females: healthy vs tumoral")
\label{volcano4}Figure 23.C: Volcano plot of the comparison healthy fremales vs tumoral females.

Figure 23.C: Volcano plot of the comparison healthy fremales vs tumoral females.

Genes differentially expressed

The number of genes which are differentialed expressed can be calculated with:

de_tumor <- subset(tt_tumor, abs(logFC) >= lfc_threshold & B >= 4.6 )
de_gender <- subset(tt_gender, abs(logFC) >= lfc_threshold & B >= 4.6)
de_males <- subset(tt_males, abs(logFC) >= lfc_threshold & B >= 4.6)
de_females <- subset(tt_females, abs(logFC) >= lfc_threshold & B >= 4.6)

# Number of genes which pass the filter
de_genes <- c(nrow(de_tumor), nrow(de_gender), nrow(de_males), nrow(de_females))
names(de_genes) <- c("Normal vs Disease", "Females vs Males", "Males", "Females")
de_genes
## Normal vs Disease  Females vs Males             Males           Females 
##               582                 4                55                91

As we can see there aren’t any gene of the comparison male vs female that pass the filter. We can compare if in the different comparisons such genes remain allways up-regulated or down-regulated.

Exploration of differential expressed genes

But we create some function to make it easier to compare several lists:

up_down <- function(data,  names = NULL){
    # Classify the genes in up or down, names are the names of the genes and 
    # should be on the same order
    dif_genes <- ifelse(data$logFC > 0, "up", "down")
    if (is.null(names)){
        names(dif_genes) <- rownames(data)
    } else {
        names(dif_genes) <- names
    }
    
    dif_genes
}

classify <- function(regulated1, regulated2){
    # Check how much up regulated genes in one comparison are down regulated in others

    inter <- intersect(names(regulated1), names(regulated2))
    
    genes <- c()
    genes2 <- c()
    
    # Reorder them by the name for the comparison with table
    for (name in inter){
        genes <- c(genes, regulated1[name])
        genes2 <- c(genes2, regulated2[name])
    }
    
    # Perform the comparison
    table(genes, genes2, 
          dnn = c(deparse(substitute(regulated1)), deparse(substitute(regulated2))))
}

genes_names <- function(regulated1, regulated2){
    # Extract the names of those up-regulated, down-regulated or not clear in
    # both lists
    inter <- intersect(names(regulated1), names(regulated2))
    
    up <- c()
    down <- c()
    not_clear <- c()
    for (name in inter){
        if (regulated1[name] == regulated2[name]){
            if (regulated1[name] == "up"){
                up <- c(up, name)
            } else{
                down <- c(down, name)
            }
        } else{
            not_clear <- c(not_clear, name)
        }
    }
    
    list(up = up, down = down, not_clear = not_clear)
}

The function up_down classify the genes in up-regulated and down-regulated. The function classify given two lists of genes up or down regulates, find the genes common in them and check if in both lists are up-regulated or down-regulated, however it doesn’t display the names, just the numbers. However the function genes_names classify the genes on the three elements of the list: up if in both sets is up-regulated, down if in both sets the genes is down-regulated and not_clear if in a dataset the gene is up-regulated and in the other is down-regulated. We can see here the results of several comparisons:

r_tumor <- up_down(de_tumor)
r_females <- up_down(de_females)
r_males <- up_down(de_males)

# We collect the name of genes for further studies
genes_females_males <- genes_names(r_females, r_males)
genes_tumor_males <- genes_names(r_tumor, r_males)
genes_tumor_females <- genes_names(r_tumor, r_females)

# We see how do they overlap, more up-regulated, or more dow-regulated genes.
classify(r_tumor, r_males)
##        r_males
## r_tumor up
##    down  1
##    up    1
classify(r_tumor, r_females)
##        r_females
## r_tumor up
##    down  1
##    up    1
classify(r_females, r_males)
##          r_males
## r_females down up
##      down   19  0
##      up      0 19

We can observe that there are 1 genes which depending on the comparison they change the direction from up-regulated to dow-regulated or vice versa. We have already obtained list of genes classified by the behavior in two comparisons with genes_names function. Furthermore we can extract all the names of the other comparisons to see how much they overlap:

DE_tumor <- unique(rownames(de_tumor))
DE_males <- unique(rownames(de_males))
DE_females <- unique(rownames(de_females))

Venn diagrams

To visualize it we plot them with venn diagrams:

library("Vennerable")
plot(Venn(list(DE_tumor, DE_females),
              SetNames = c("Tumoral", "Females")))
\label{venn1}Figure 24.A: Representation of the number of genes differentially expressed in comparison between females, and healthy vs tumoral.

Figure 24.A: Representation of the number of genes differentially expressed in comparison between females, and healthy vs tumoral.

plot(Venn(list(DE_tumor, DE_males),
              SetNames = c("Tumoral", "Males")))
\label{venn2}Figure 24.B: Representation of the number of genes differentially expressed in comparison between males, and healthy vs tumoral.

Figure 24.B: Representation of the number of genes differentially expressed in comparison between males, and healthy vs tumoral.

plot(Venn(list(DE_females, DE_males),
              SetNames = c("Females", "Males")))
\label{venn3}Figure 24.C: Representation of the number of genes differentially expressed in comparison between males and females.

Figure 24.C: Representation of the number of genes differentially expressed in comparison between males and females.

As we can observe of the comparison, many genes differentially expressed on females comparing healthy samples and tumoral ones are not found on males.

Functional enrichment

We can perform a functional enrichment on our data with the GO already stored

library(GOstats)

go_obj <- function(genes_id, universe = unique(annot.Entrez.GO)){
    # Create an object for GOstat from ENTREZIDs
    new("GOHyperGParams", geneIds = genes_id,
                    universeGeneIds = universe,
                    annotation = "org.Hs.eg.db", ontology = "BP",
                    pvalueCutoff = 0.05, testDirection = "over", 
                    conditional = TRUE)
}

params_tumor <- go_obj(DE_tumor)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
## Warning in makeValidParams(.Object): removing geneIds not in
## universeGeneIds
params_males <- go_obj(DE_males)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
params_females <- go_obj(DE_females)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist

## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
# The set of differentially expressed only in females
mask <- DE_females %in% intersect(DE_males, DE_females)
DE_females_exc <- DE_females[!mask]
DE_females_exc <- DE_females_exc[!DE_females_exc %in% intersect(DE_females_exc, DE_tumor)]
params_female_exc <- go_obj(DE_females_exc)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist

## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
# The set of differentially expressed only in males
mask <- DE_males %in% intersect(DE_males, DE_females)
DE_males_exc <- DE_males[!mask]
DE_males_exc <- DE_males_exc[!DE_males_exc %in% intersect(DE_males_exc, DE_tumor)]
params_male_exc <- go_obj(DE_males_exc)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist

## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
# We explore those genes that are shared among the comparisons
go_females_males_up <- go_obj(genes_females_males$up)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist

## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
go_females_males_down <- go_obj(genes_females_males$down)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist

## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
go_tumor_females_not_clear <- go_obj(genes_tumor_females$not_clear)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist

## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
go_tumor_males_not_clear <- go_obj(genes_tumor_males$not_clear)
## Warning in makeValidParams(.Object): converting univ from list to atomic
## vector via unlist

## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds

Once prepared with the object we can run the analysis:

go_analisys <- function(go_obj, annots.GO, size = 5, count = 5, pvalue = 0.05){
    hgOver <- hyperGTest(go_obj)
    
    GO <- GOstats::summary(hgOver)
    
    GO <- subset(GO, Size >= size & Count >= count & Pvalue <= pvalue)
    if (nrow(GO) == 0){
        warning("Not significant GO on this group with these settings.")
        return(NULL)
    }
    
    # Visualize the data
    plotgo <- ggplot(GO, aes(Size, Count, size = OddsRatio, colour = Pvalue)) + 
        geom_point() + b + ggtitle(deparse(substitute(go_obj)))
    
    GO <- GO[order(GO$OddsRatio, decreasing = TRUE), ]
    resume <- function(id, annot){
        #
        a <- unique(annot[annot$ENTREZID %in% id, "SYMBOL"])
        if (length(a) >= 1){
            return(a)
        } else{
            return("")
        }
    }
    geneIDs <- geneIdsByCategory(hgOver)[GO$GOBPID]
    geneSYMs <- sapply(geneIDs, resume, annot = annots.GO)
    if (is.matrix(geneSYMs)){
        geneSYMs <- apply(geneSYMs, 2, paste, collapse = ", ")
    } else {
        geneSYMs <- sapply(geneSYMs, paste, collapse = ", ")
    }
    GO <- merge(GO, geneSYMs, by.x = "GOBPID", by.y = 0)
    rownames(GO) <- 1:nrow(GO)
    colnames(GO) <- c(colnames(GO)[-ncol(GO)], "Genes")
    list(GO = GO, plot = plotgo)
}

GO_tumor <- go_analisys(params_tumor, annot.Entrez.GO)
if (!is.null(GO_tumor)){
    print(head(GO_tumor$GO))
    plot(GO_tumor$plot)
}
##       GOBPID      Pvalue OddsRatio ExpCount Count Size
## 1 GO:0000165 0.021076033  2.672912 2.773153     7  134
## 2 GO:0000186 0.028543100  2.497838 2.953263     7  136
## 3 GO:0000724 0.042270413  2.758534 1.910935     5   88
## 4 GO:0000902 0.004679631  2.554324 5.052855    12  258
## 5 GO:0000904 0.028724086  2.730559 2.322160     6  116
## 6 GO:0001655 0.004756597  3.022202 3.200310     9  150
##                                                      Term
## 1                                            MAPK cascade
## 2                            activation of MAPKK activity
## 3 double-strand break repair via homologous recombination
## 4                                      cell morphogenesis
## 5          cell morphogenesis involved in differentiation
## 6                           urogenital system development
##                                                                          Genes
## 1                             AMBP, CREB1, IGBP1, ITPKB, SMAD1, RPS6KA1, SORL1
## 2                              ARAF, CSF2RB, IL2RG, MARK3, PEBP1, PPP5C, PSMD8
## 3                                              ERCC4, RAD52, RFC2, UBE2V2, WRN
## 4 ADAM8, DIAPH1, DNAH5, EPB42, PTK2B, FGD1, HCK, IDUA, NTF4, RDX, STIL, DYNLT1
## 5                                           AXL, FGG, RBPJ, SMAD7, CITED1, PXN
## 6                  ANGPT2, ARG2, CUX1, EPHB2, GDNF, SMAD1, SMAD7, CITED1, SDC1
\label{tumor1}Figure 25.A: Representation of the P-value and effect of the GO of the comparison of healty and tumoral samples.

Figure 25.A: Representation of the P-value and effect of the GO of the comparison of healty and tumoral samples.

We should do the same if we had some genes differentially expressed on the comparison between gender.

GO_female_exc <- go_analisys(params_female_exc, annot.Entrez.GO)
if (!is.null(GO_female_exc)){
    print(head(GO_female_exc$GO))
    plot(GO_female_exc$plot)
}
##       GOBPID     Pvalue OddsRatio  ExpCount Count Size
## 1 GO:0000902 0.03956983  2.276179  3.953704     8  854
## 2 GO:0001934 0.04100166  2.574681  2.569444     6  555
## 3 GO:0007154 0.02847285  1.889602 16.430556    23 3549
## 4 GO:0007409 0.04784703  2.722641  2.000000     5  432
## 5 GO:0010562 0.02611418  2.653081  2.962963     7  640
## 6 GO:0023014 0.02207980  3.008439  2.222222     6  480
##                                                  Term
## 1                                  cell morphogenesis
## 2      positive regulation of protein phosphorylation
## 3                                  cell communication
## 4                                        axonogenesis
## 5 positive regulation of phosphorus metabolic process
## 6      signal transduction by protein phosphorylation
##                                                                                                                                                                      Genes
## 1                                                                                                                 COL6A1, CCL24, VAX2, EHD3, RTN4, CDH23, RASAL3, RASGEF1A
## 2                                                                                                                             CCL20, CCL24, RASAL3, INHBE, TRAF7, RASGEF1A
## 3 APLP1, BICD1, INPP1, CD46, PDE7A, SCTR, CCL20, CCL24, TAAR5, SIGMAR1, IFITM2, CRTC1, VAX2, IL21R, CD244, PCDHB6, RTN4, RALGAPA2, RASAL3, INHBE, TRAF7, CREB3L4, RASGEF1A
## 4                                                                                                                                     COL6A1, VAX2, RTN4, RASAL3, RASGEF1A
## 5                                                                                                                      CCL20, CCL24, CD244, RASAL3, INHBE, TRAF7, RASGEF1A
## 6                                                                                                                             CCL20, CCL24, RASAL3, INHBE, TRAF7, RASGEF1A
\label{tumor2}Figure 25.B: Representation of the P-value and effect of the GO of the comparison of differentially expressed genes exclusivly from females

Figure 25.B: Representation of the P-value and effect of the GO of the comparison of differentially expressed genes exclusivly from females

GO_male_exc <- go_analisys(params_male_exc, annot.Entrez.GO)
## Warning in go_analisys(params_male_exc, annot.Entrez.GO): Not significant
## GO on this group with these settings.
if (!is.null(GO_male_exc)){
    print(head(GO_male_exc$GO))
    plot(GO_male_exc$plot)
}
GO_females_males_up <- go_analisys(go_females_males_up, annot.Entrez.GO)
if (!is.null(GO_females_males_up)){
    print(head(GO_females_males_up$GO))
    plot(GO_females_males_up$plot)
}
##       GOBPID     Pvalue OddsRatio ExpCount Count Size
## 1 GO:0009888 0.00821382  4.265263 1.896825     6  956
## 2 GO:0022414 0.01125840  4.472160 1.432540     5  722
## 3 GO:0048869 0.01605609  3.121074 4.376984     9 2206
##                             Term
## 1             tissue development
## 2           reproductive process
## 3 cellular developmental process
##                                                           Genes
## 1                       CACNA1S, RBPJ, SOCS3, IL20, FA2H, VSIG1
## 2                               ADRA2C, RBPJ, DDO, SOCS3, SPAG6
## 3 ADRA2C, CACNA1S, RBPJ, SOCS3, SPAG6, IL20, AP1AR, FA2H, VSIG1
\label{tumor4}Figure 25.D: Representation of the P-value and effect of the GO of the up-regulated genes in both  male and female comparisons

Figure 25.D: Representation of the P-value and effect of the GO of the up-regulated genes in both male and female comparisons

GO_females_males_down <- go_analisys(go_females_males_down, annot.Entrez.GO)
if (!is.null(GO_females_males_down)){
    print(head(GO_females_males_down$GO))
    plot(GO_females_males_down$plot)
}
##       GOBPID     Pvalue OddsRatio ExpCount Count Size                Term
## 1 GO:0007165 0.04551559  2.737523 5.315807     9 3215 signal transduction
##                                                         Genes
## 1 BAG1, CXCL2, PCSK6, MAPK3, CRADD, CNTN6, OR13C5, DGKK, OPN5
\label{tumor5}Figure 25.E: Representation of the P-value and effect of the GO of the down-regulated genes in both  male and female comparisons

Figure 25.E: Representation of the P-value and effect of the GO of the down-regulated genes in both male and female comparisons

GO_tumor_females_not_clear <- go_analisys(go_tumor_females_not_clear, annot.Entrez.GO)
## Warning: No results met the specified criteria. Returning 0-row data.frame
## Warning in go_analisys(go_tumor_females_not_clear, annot.Entrez.GO): Not
## significant GO on this group with these settings.
if (!is.null(GO_tumor_females_not_clear)){
    print(head(GO_tumor_females_not_clear$GO))
    plot(GO_tumor_females_not_clear$plot)
}
GO_females <- go_analisys(params_females, annot.Entrez.GO)
if (!is.null(GO_females)){
    print(head(GO_females$GO))
    plot(GO_females$plot)
}
##       GOBPID     Pvalue OddsRatio  ExpCount Count Size
## 1 GO:0000904 0.00333100  2.743966  4.927249    12  596
## 2 GO:0001775 0.02662208  2.269251  4.290675     9  519
## 3 GO:0001934 0.03858000  2.110639  4.588294     9  555
## 4 GO:0002684 0.02252441  2.342149  4.166667     9  504
## 5 GO:0006935 0.01823511  2.435678  4.017857     9  486
## 6 GO:0007154 0.02731885  1.604748 29.340278    38 3549
##                                             Term
## 1 cell morphogenesis involved in differentiation
## 2                                cell activation
## 3 positive regulation of protein phosphorylation
## 4   positive regulation of immune system process
## 5                                     chemotaxis
## 6                             cell communication
##                                                                                                                                                                                                                                                                           Genes
## 1                                                                                                                                                                                        CACNA1S, COL6A1, RBPJ, MAPK3, VAX2, CNTN6, AP1AR, RTN4, CDH23, RASAL3, RASGEF1A, VSIG1
## 2                                                                                                                                                                                                                  ADRA2C, RBPJ, CD46, MAPK3, IL21R, CD244, RASAL3, DGKK, KLRF2
## 3                                                                                                                                                                                                             ADRA2C, MAPK3, CCL20, CCL24, IL20, RASAL3, INHBE, TRAF7, RASGEF1A
## 4                                                                                                                                                                                                                   CXCL2, CD46, MAPK3, CCL20, CCL24, IL20, CD244, RASAL3, GBP5
## 5                                                                                                                                                                                                          CACNA1S, COL6A1, CXCL2, MAPK3, CCL20, CCL24, CNTN6, RASAL3, RASGEF1A
## 6 ADRA2C, APLP1, BAG1, BICD1, CXCL2, RBPJ, INPP1, CD46, PCSK6, PDE7A, MAPK3, SCTR, CCL20, CCL24, CRADD, SOCS3, TAAR5, SIGMAR1, IFITM2, CRTC1, VAX2, CNTN6, IL20, IL21R, CD244, PCDHB6, RTN4, RALGAPA2, RASAL3, INHBE, TRAF7, PSD2, KCNG4, OR13C5, DGKK, CREB3L4, RASGEF1A, OPN5
\label{tumor7}Figure 25.G: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in females comparison

Figure 25.G: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in females comparison

GO_males <- go_analisys(params_males, annot.Entrez.GO)
if (!is.null(GO_males)){
    print(head(GO_males$GO))
    plot(GO_males$plot)
}
##       GOBPID      Pvalue OddsRatio ExpCount Count Size
## 1 GO:0006955 0.021239350  2.454326 4.208995     9  888
## 2 GO:0009617 0.001648703  5.342207 1.289242     6  272
## 3 GO:0043207 0.025912660  2.888176 2.303571     6  486
##                                   Term
## 1                      immune response
## 2                response to bacterium
## 3 response to external biotic stimulus
##                                                  Genes
## 1 CXCL2, RBPJ, IK, MAPK3, VIP, SOCS3, IL20, IL22, GBP5
## 2               CXCL2, RBPJ, MAPK3, VIP, SOCS3, UGT1A1
## 3               CXCL2, RBPJ, MAPK3, VIP, SOCS3, UGT1A1
\label{tumor8}Figure 25.H: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in males comparison

Figure 25.H: Representation of the P-value and effect of the GO of the genes with the differentially expressed genes in males comparison

GO_tumor_males_not_clear <- go_analisys(go_tumor_males_not_clear, annot.Entrez.GO)
## Warning: No results met the specified criteria. Returning 0-row data.frame
## Warning in go_analisys(go_tumor_males_not_clear, annot.Entrez.GO): Not
## significant GO on this group with these settings.
if (!is.null(GO_tumor_males_not_clear)){
    print(head(GO_tumor_males_not_clear$GO))
    plot(GO_tumor_males_not_clear$plot)
}

We should look in details to those values and associations in the tables.

library(xtable)
go_table <- function(go_obj, ftitle){
    # Function to output the table in html format
    if (is.null(go_obj)){
        warning("Null Object: unable to creat the table")
    } else {
        xtab_tumor <- xtable(go_obj, align="l|c|r|r|r|r|r|p{3cm}|p{3cm}|")
        print(x=xtab_tumor, file = ftitle, type = "html")
    }
}

go_table(GO_tumor$GO, "GO_tumor_table.html")
go_table(GO_females$GO, "GO_females_table.html")
go_table(GO_males$GO, "GO_males_table.html")

go_table(GO_female_exc$GO, "GO_female_exc_table.html")
go_table(GO_male_exc$GO, "GO_male_exc_table.html")
## Warning in go_table(GO_male_exc$GO, "GO_male_exc_table.html"): Null Object:
## unable to creat the table
go_table(GO_females_males_up$GO, "GO_females_males_up_table.html")
go_table(GO_females_males_down$GO, "GO_females_males_down_table.html")

go_table(GO_tumor_females_not_clear$GO, "GO_tumor_females_not_clear_table.html")
## Warning in go_table(GO_tumor_females_not_clear$GO,
## "GO_tumor_females_not_clear_table.html"): Null Object: unable to creat the
## table
go_table(GO_tumor_males_not_clear$GO, "GO_tumor_males_not_clear_table.html")
## Warning in go_table(GO_tumor_males_not_clear$GO,
## "GO_tumor_males_not_clear_table.html"): Null Object: unable to creat the
## table

To explore more than the top 6 GO we should look into the tables generated with any browser. In GO_tumor_table we have the list of GO enrich on the differentially expressed genes in the comparison tumor vs healthy samples.
In GO_females_table we have the list of GO enrich on the differentially expressed genes in the comparison tumoral females vs healthy females.
In GO_males_table we have the list of GO enrich on the differentially expressed genes in the comparison between tumoral males samples and healthy males samples.
In GO_female_exc we have the list of GO enrich on the differentially expressed genes in the comparison Tumor vs Normals.
In GO_females_males_up we have the list of GO enrich on the differentially expressed genes which are shared between the males and the females and in both cases are up-regulated.
In GO_females_males_down we have the list of GO enrich on the differentially expressed genes which are shared between the males and the females and in both cases are down-regulated.
In GO_tumor_females_not_clear we have the list of GO enrich on the differentially expressed genes shared between the comparison of tumoral vs healthy samples and tumoral samples and healthy females.
In GO_tumor_males_not_clear we have the list of GO enrich on the differentially expressed genes shared between the comparison of tumoral vs healthy samples and tumoral samples and healthy males.

SessionInfo

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] xtable_1.8-2               GO.db_3.3.0               
##  [3] GOstats_2.38.0             graph_1.50.0              
##  [5] Category_2.38.0            Matrix_1.2-6              
##  [7] Vennerable_3.1.0.9000      org.Hs.eg.db_3.3.0        
##  [9] corpcor_1.6.8              sva_3.20.0                
## [11] genefilter_1.54.2          mgcv_1.8-12               
## [13] nlme_3.1-128               cqn_1.18.0                
## [15] quantreg_5.26              SparseM_1.7               
## [17] preprocessCore_1.34.0      nor1mix_1.2-1             
## [19] mclust_5.2                 geneplotter_1.50.0        
## [21] annotate_1.50.0            XML_3.98-1.4              
## [23] AnnotationDbi_1.34.3       lattice_0.20-33           
## [25] edgeR_3.14.0               limma_3.28.6              
## [27] ggplot2_2.1.0              SummarizedExperiment_1.2.2
## [29] Biobase_2.32.0             GenomicRanges_1.24.1      
## [31] GenomeInfoDb_1.8.1         IRanges_2.6.0             
## [33] S4Vectors_0.10.1           BiocGenerics_0.18.0       
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.1         colorspace_1.2-6       htmltools_0.3.5       
##  [4] yaml_2.1.13            survival_2.39-4        RBGL_1.48.1           
##  [7] DBI_0.4-1              RColorBrewer_1.1-2     plyr_1.8.4            
## [10] stringr_1.0.0          zlibbioc_1.18.0        MatrixModels_0.4-1    
## [13] munsell_0.4.3          gtable_0.2.0           evaluate_0.9          
## [16] labeling_0.3           knitr_1.13             GSEABase_1.34.0       
## [19] Rcpp_0.12.5            KernSmooth_2.23-15     scales_0.4.0          
## [22] formatR_1.4            XVector_0.12.0         digest_0.6.9          
## [25] stringi_1.1.1          grid_3.3.0             tools_3.3.0           
## [28] magrittr_1.5           RSQLite_1.0.0          rmarkdown_0.9.6       
## [31] AnnotationForge_1.14.2

Bibliography

  • Agrawal N, Akbani R, Aksoy BA, Ally A, Arachchi H, Asa SL, Zou L. (2014). Integrated Genomic Characterization of Papillary Thyroid Carcinoma. Cell, 159(3), 676-690.
  • The Canger Genome Atlas. Papillary Thyroid Carcinoma. Retrieved May 6, 2016. Available from: http://cancergenome.nih.gov/cancersselected/thyroid
  • Xing M. (2013). Molecular pathogenesis and mechanisms of thyroid cancer. Nature Reviews. Cancer, 13(3), 184-99.